We propose to implement a simple Artificial Neural Network (ANN) from scratch using only the Numpy library. Although it is more efficient to use deep learning libraries such as Tensorflow or Pytorch, the motivation is to have a better understanding of how ANNs work.

Architecture

We will look at implementing an ANN with 3 input neurons. This means that our problem has 3 features and 1 binary output variable. The architecture is shown below.

machine learning types Architecture of the neural network

For each variable, the layer number is shown in superscript and the item number in each variable is shown in subscript. For example, \(w^L_i\) is the \(i^{th}\) weight in layer \(L\).

Feedforward

We will derive the equations as in the previous posts. Please refer to them for more details. The activation of the predicted output \(\hat{y}\) can be written as follows,

\[\hat{y} = a^1_0 = \sigma(z^1_0)\]

From the network architecture, we get:

\[z^1_0 = a^0_0 w^0_0 + a^0_1 w^0_1 + a^0_2 w^0_2 + b^0\]

Backpropagation

The MSE cost function is defined as follows,

\[J = \frac{1}{m}\sum_{i=1}^{m}(a^1_{0i}-y_i)^2\]

Let’s apply Gradient Descent to the first weight \(w^0_0\).

\[w^0_0 := w^0_0 - \alpha \frac{\partial J}{\partial w^0_0}\]

The chain rule now becomes,

\[\frac{\partial J}{\partial w^0_0} = \frac{\partial J}{\partial a^1_0} \frac{\partial a^1_0}{\partial z^1_0} \frac{\partial z^1_0}{\partial w^0_0}\]

Isolating each terms, we have:

\[\begin{align*} \frac{\partial J}{\partial a^1_0} &= \frac{2}{m}(a^1_0-y) \\ \frac{\partial a^1_0}{\partial z^1_0} &= \sigma'(z^1_0) \\ \frac{\partial z^1_0}{\partial w^0_0} &= a^0_0 \end{align*}\]

If we repeat the same steps for the other 2 weights and the bias, we get:

\[\begin{align*} \frac{\partial J}{\partial w^0_0} &= \frac{2}{m}(a^1_0-y) \sigma'(z^1_0) a^0_0 \\ \frac{\partial J}{\partial w^0_1} &= \frac{2}{m}(a^1_0-y) \sigma'(z^1_0) a^0_1 \\ \frac{\partial J}{\partial w^0_2} &= \frac{2}{m}(a^1_0-y) \sigma'(z^1_0) a^0_2 \\ \frac{\partial J}{\partial b^0} &= \frac{2}{m}(a^1_0-y) \sigma'(z^1_0) \end{align*}\]

We are now ready to implement!

Numpy implementation

The implementation is inspired from this article.

The problem

efore tackling the implementation itself, we need to define a problem to solve. Let’s build a toy dataset for a simple classification problem. Suppose we have some information about obesity, smoking habits, and exercise habits of five people. We also know whether these people are diabetic or not. We can encode this information as follows:

Person Smoking Obesity Exercise Diabetic
Person 1 0 1 0 1
Person 2 0 0 1 0
Person 3 1 0 0 0
Person 4 1 1 0 1
Person 5 1 1 1 1

In the above table, we have five columns (i.e. 5 observations): Person, Smoking, Obesity, Exercise, and Diabetic. Here 1 refers to true and 0 refers to false. For instance, the first person has values of 0, 1, 0 which means that the person doesn’t smoke, is obese, and doesn’t exercise. The person is also diabetic.

It is clearly evident from the dataset that a person’s obesity is indicative of him being diabetic. Our task is to create a neural network that is able to predict whether an unknown person is diabetic or not given data about his exercise habits, obesity, and smoking habits. This is a type of supervised learning problem where we are given inputs and corresponding correct outputs and our task is to find the mapping between the inputs and the outputs.

The code

We will base our implementation on the neural network architecture described above. We start by importing some libraries and defining the Sigmoid function and its derivative.

from matplotlib import pyplot as plt  
import numpy as np  


def sigmoid(x):  
    return 1/(1+np.exp(-x))

def sigmoid_der(x):  
    return sigmoid(x) * (1-sigmoid(x))

We then define our data set based on the problem described above.

X = np.array([[0, 1, 0], [0, 0, 1], [1, 0, 0], [1, 1, 0], [1, 1, 1]])  
y = np.array([[1, 0, 0, 1, 1]])  
y = y.reshape(5, 1)

We initialise the weights and biases randomly and we define the hyperparameters of the model.

np.random.seed(42)  
weights = np.random.rand(3, 1)  
bias = np.random.rand(1)  
alpha = 0.05        # learning rate
nb_epoch = 20000
H = np.zeros((nb_epoch, 6))  # history
m = len(X)   # number of observations

During the training phase, we perform feedforward and backpropagation steps using the MSE cost function, the chain rule and the gradient descent algorithm.

for epoch in range(nb_epoch):  

    # FEEDFORWARD
    z = np.dot(X, weights) + bias  # (5x1)
    a = sigmoid(z)  # (5X1)

    # BACKPROPAGATION
    # 1. cost function: MSE
    J = (1/m) * (a - y)**2   # (5X1)

    # 2. weights
    dJ_da = (2/m)*(a-y) 
    da_dz = sigmoid_der(z)
    dz_dw = X.T

    gradient_w = np.dot(dz_dw, da_dz*dJ_da)  # chain rule 
    weights -= alpha*gradient_w               # gradient descent

    # 3. bias
    gradient_b = da_dz*dJ_da   # chain rule
    bias -= alpha*sum(gradient_b)  # gradient descent

    # Record history for plotting
    H[epoch, 0] = epoch
    H[epoch, 1] = J.sum()
    H[epoch, 2:5] = np.ravel(weights)
    H[epoch, 5] = np.asscalar(bias)

    print("Epoch {}/{} | cost function: {}".format(epoch, nb_epoch, J.sum()))

We can then plot the evolution of the cost function, weights and bias with the number of iterations (epoch).

plt.plot(H[:, 0], H[:, 1])
plt.xlabel('Epoch #')
plt.ylabel('Training error')
plt.savefig('plots/1_J_vs_epoch.png')
plt.show()

plt.plot(H[:, 0], H[:, 2], label='$w_{11}$')
plt.plot(H[:, 0], H[:, 3], label='$w_{21}$')
plt.plot(H[:, 0], H[:, 4], label='$w_{31}$')
plt.xlabel('Epoch #')
plt.ylabel('Weights')
plt.legend()
plt.savefig('plots/1_weights_vs_epoch.png')
plt.show()

plt.plot(H[:, 0], H[:, 5])
plt.xlabel('Epoch #')
plt.ylabel('Bias')
plt.savefig('plots/1_bias_vs_epoch.png')
plt.show()

MSE vs epoch MSE vs epoch

The training error (MSE) keeps decreasing with the number of iterations, which is a good sign.

Weights_vs_epoch Weights vs epoch

We can also notice that the weight \(w_{21}\) becomes predominant after many iterations. This is because the 2nd feature (obesity) is very highly correlated with the output variable (diabetic).

Bias vs epoch Bias vs epoch

Finally, we test our neural network on some unseen examples. Let’s consider two “new” people who did not appear in the training set. For example, a smoker not obese who does some exercise (defined by [1, 0, 1]) is classified as not diabetic.

example1 = np.array([1, 0, 1])  
result1 = sigmoid(np.dot(example1, weights) + bias)  
print(result1.round())

>> [0.]

However, an obese non-smoker who exercises ([0, 1, 1]) is predicted to be diabetic.

example2 = np.array([0, 1, 1])  
result2 = sigmoid(np.dot(example2, weights) + bias)  
print(result2.round())

>> [1.]

You can find the code on my Github.

Conclusion

This article described the theory of a very simple neural network with one input layer and one output layer. It was implemented in plain Numpy and applied to a simple classification problem with 3 features and 5 observations.

This type of neural network is called a Single-Layer Perceptron and it is capable of classify linearly separable data. However, most real-world problems require to identify non-linear decision boundaries. In the next article, I will describe how Multi-Layer Perceptron can be used to estimate non-linear decision boundaries.

Leave a comment